Hands-on RNA-seq Analysis from fastq

Sheffield Bioinformatics Core

web : sbc.shef.ac.uk
twitter: [@SheffBioinfCore](https://twitter.com/SheffBioinfCore)
email: bioinformatics-core@sheffield.ac.uk


Tutorial Overview

This tutorial will cover the basics of RNA-seq using Galaxy and Degust; two open-source web-based platforms for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.

Using these notes

Sections with this background indicate exercises to be completed during the workshop.

Sections with this background highlight particular shortcuts or other references that might be useful.

Sections with this background give information about potential error messages or might encounter, or problems that might arise in your own data analysis.

More on Galaxy

The official Galaxy page has many tutorials on using the service, and examples of other types of analysis that can be performed on the platform.

Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses

Courses on analysing RNA-seq data in R

RNA-seq workflow

Two workflows are possible with RNA-seq data - with the difference being whether one performs an alignment to the reference genome or not.

Recent tools for RNA-seq analysis (e.g. salmon, kallisto) do not require the time-consuming step of whole-genome alignment to be performed, and can therefore produce gene-level counts in a much faster time frame. They not require the creation of large bam files, which is useful if constrained by file space on Galaxy.

(image from Harvard Bioinformatics Core)


Background

Where do the data in this tutorial come from?

The data for this tutorial comes from a Journal of Experimental Medicine paper “Itraconazole targets cell cycle heterogeneity in colorectal cancer”. This study examines the expression profiles of two cell lines in response to treatment with itraconazole.

For this tutorial, we will assume that the wet-lab stages of the experiment have been performed and we are now in the right-hand branch of the workflow. In this tutorial we will demonstrate the steps of Quality assessment, alignment, quantification and differential expression testing.

The fastq data for this experiment were made available on the Sequencing Read Archive (SRA) with accession SRP144496. For the purposes of this workshop we have created a downsampled dataset

The experimental design for the dataset is summarised in the table below.

run name cell_line condition
SRR7108388 HT55_CONT_1 HT55 DMSO
SRR7108389 HT55_CONT_2 HT55 DMSO
SRR7108390 HT55_CONT_3 HT55 DMSO
SRR7108391 HT55_CONT_4 HT55 DMSO
SRR7108392 HT55_ITRA_1 HT55 ITRACONAZOLE
SRR7108393 HT55_ITRA_2 HT55 ITRACONAZOLE
SRR7108394 HT55_ITRA_3 HT55 ITRACONAZOLE
SRR7108395 HT55_ITRA_4 HT55 ITRACONAZOLE
SRR7108396 SW948_CONT_1 SW948 DMSO
SRR7108397 SW948_CONT_2 SW948 DMSO
SRR7108398 SW948_CONT_3 SW948 DMSO
SRR7108399 SW948_CONT_4 SW948 DMSO
SRR7108400 SW948_ITRA_1 SW948 ITRACONAZOLE
SRR7108401 SW948_ITRA_2 SW948 ITRACONAZOLE
SRR7108402 SW948_ITRA_3 SW948 ITRACONAZOLE
SRR7108403 SW948_ITRA_4 SW948 ITRACONAZOLE

Digression: How to download raw sequencing files

The Sequencing Read Archive (SRA) is commonly-used to store the raw data from sequencing experiements and can be accessed through the NCBI website. However, the interface is not particularly friendly and the links to download data and not easy to obtain.

An easier alternative exists in the form of SRA Explorer

The SRA accession (usually found in a paper describing the dataset) can be entered into the Search box, and all the samples belonging to that dataset should be found. Samples of interest can be saved, and upon “checkout” the download links (URLs) will be displayed. A command-line tool such as curl or wget can then be used to download the files locally.

Section 1: Preparation

1. Sign-up to the European Galaxy server

Make sure you check your email to activate your account

Now, use this link to access a special queue. This link will only be active on December 6th.

2. Download the course data

The data for this course have all been shared on a google drive. If you have not done so already, please go to this directory and download the following files

https://drive.google.com/drive/folders/1RSuvl9shAw12Bj77uYSUdWtkZ5ST5EWi?usp=sharing

  • SRR7108388.fastq.gz
  • SRR7108389.fastq.gz
  • SRR7108392.fastq.gz
  • SRR7108393.fastq.gz
  • Homo_sapiens.GRCh38.cdna.all.fa.gz
  • Homo_sapiens.GRCh38.104.gtf.gz
  • tx2gene.txt

3. Import the RNA-seq data for the workshop.

We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. To make the practical quicker, we have downsampled the original fastq files to a quarter of a million reads.

Get Data -> Upload File

You can import the data by:

  1. In the tool panel located on the left, under Basic Tools select Get Data > Upload File. Click on the Choose local file button on the bottom section of the pop-up window.
  2. Navigate to the fastq directory of the zip file that you downloaded from google drive and select these two files from the HT55-DMSO condition.

SRR7108388.fastq.gz SRR7108389.fastq.gz

and these two files are from the HT55 ITRACONAZOLE condition.

SRR7108392.fastq.gz SRR7108393.fastq.gz

also upload the files Homo_sapiens.GRCh38.cdna.all.fa.gz, Homo_sapiens.GRCh38.104.gtf.gz and tx2gene.txt. These are reference files that we will use later.

  1. You should now have these 4 fastq files in your history:
    • SRR7108388.fastq.gz
    • SRR7108389.fastq.gz
    • SRR7108392.fastq.gz
    • SRR7108393.fastq.gz

The annotation files may take a while longer to upload

The .gz at the end of each file name means that it is compressed (like a zip file).

You can upload the other files for extra practice if you wish

Quality assessment with FastQC (Optional)

FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.

FastQ Quality Control -> FastQC Read Quality reports

  • Select one of the FASTQ files as input and Execute the tool.
  • When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.
  • Run Fastqc on the remaing fastq files, but don’t examine the results just yet.

Question: Do the data seem to be of reasonable quality?

You can use the documentation to help interpret the plots

If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to trim our data.

However, a recent paper demonstrated that read trimming is no longer required prior to alignment:- https://www.biorxiv.org/content/10.1101/833962v1

If you also suspect contamination by another organism, or rRNA present in your data, you can use the sortMeRNA tool to remove this artefact.

Combining QC reports

It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.

FASTQ Quality Control -> Multiqc

Under Which tool was used generate logs? Choose fastqc and select the RawData output from the fastqc run on each of your bam files.

Question: Repeat the FastQC analysis for the remaining fastq files and combine the reports with multiQC. Do the fastq files seem to have consistently high-quality?

Section 2: Quantification

We can align our RNA-seq reads to a reference genome, and then overlap with know gene coordinates, but many prefer to align directly to the transcriptome sequences using a method such as salmon or kallisto. We will demonstrate the salmon protocol.

Obtaining the files for salmon

  1. cDNA fasta file

This file has been provided in the google drive folder and should have been uploaded to your Galaxy history.

However, it is useful to know where this file came from in case you are not working with Human data. The file was obtained from Ensembl by clicking on the cDNA (FASTA) link for the appropriate organism (Human).

On the next screen, Right-click to save the .cdna.all.fa.gz to your computer

  1. Transcript mapping file

By default, salmon will produce counts for each transcript. This might be what we want, but for most standard analyses it is preferable to work at the gene-level. We therefore have to tell salmon how the transcripts in the cDNA file relate to known genes. Such a file can be obtained from biomart.

  • Select the Ensembl genes (104) database
  • Select the dataset Human genes (GRCh38.13)
  • In Attributes, click the “+” button next to GENE and select Transcript stable ID version and Gene stable ID. Make sure the order on the left-hand panel is Transcript stable ID version, followed by Gene stable ID. This will affect the column order in the file. You can tick / untick the IDs to make sure the order is correct.

  • Click the Results button in the top left corner to see a preview of the results. Clicking the Go button will export the results to a file

It is important that the file downloaded from Biomart is edited so that the column headings do not contain any spaces. You can do this in a text editor. The edited file should look like this.

It is important to make sure the version number of your transcript file and the biomaRt dataset are the same, otherwise some of the steps downstream might not work as expected.

If you have problems, this mapping file is also provided in the google drive as tx2gene.txt.

  1. Annotation file (optional)

The Ensembl gene IDs are not particularly memorable, so it would be highly beneficial to have other annotations at hand to help us interpret the data downstream. We can use the biomart website again to produce a table to downstream intrepretation.

This time, select only the Gene Stable ID tickbox in the GENE box. Expand the EXTERNAL panel by clicking the “+” next to EXTERNAL, and select HGNC symbol and NCBI gene (formerly Entrezgene) ID

salmon configuration and running

RNA Analysis -> Salmon quant

  • Select the Homo_sapiens_GRCh38.cdna.all.fa.gz file as the Transcripts fasta file
  • Select all your uploaded fastq files as your Data Input FASTQ/FASTA file
  • Scroll down to File containing a mapping of transcripts to genes and select the tx2gene.txt file

Two jobs will now be queued for each sample fastq file. The Quantification output will contain transcript-level data, and the Gene Quantification output will be at the gene-level. We should expect the number of lines in the Gene Quantification file to be substantially less. If not, you will need to check that your transcript mapping file was correct.

The Gene Quantification output from each sample comprises the following columns (taken from the salmon documentation)

  • Name — This is the name of the target transcript provided in the input transcript database (FASTA file).
  • Length — This is the length of the target transcript in nucleotides.
  • EffectiveLength — This is the computed effective length of the target transcript. It takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias (if they are being modeled).
  • TPM — This is salmon’s estimate of the relative abundance of this transcript in units of Transcripts Per Million (TPM). TPM is the recommended relative abundance measure to use for downstream analysis.
  • NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

Note that we are using a downsampled dataset, so the majority of NumReads will be zero.

Create a count matrix

Methods for detecting differential expression are likely to want data in the form of a table; where every row is a different gene and each column is a unique biological sample. Before we can proceed we will therefore need to merge our salmon results into a single output. This can be down using the Salmon quantmerge tool

RNA Analysis -> Salmon quantmerge

Use the +Insert Quant file and names button repeatedly to select each of your Gene Quantification outputs. The One-word sample names text box can be used to create a shorter column name for each output.

Once all the Gene Quantification files have been selected the drop-down menu under Columns should be changed from Length to NumReads.

After the tool has finished you should have a table with

Adding extra annotation to results

Text Manipulation -> Join two files

  • 1st file: Column Join on data….
  • Column to use from 1st file: Column 1
  • 2nd file: result from annotateMyIDs on data…
  • Column to use from 2nd file: Column 1

(Optional) Alternative workflow involving genome alignment

If time allows, we will also follow this section

The workflow that people used for many years is summarised in this image from Ting-you Wang’s RNA-seq data analysis page, and may still be preferable if your analysis doesn’t just call for gene-level counts.

Mapping -> HISAT2

1. Map/align the reads with HISAT2 to the hg38 reference genome

In the left tool panel menu, under NGS Analysis, select Mapping > HISAT2 and set the parameters as follows:

  • Is this single-end or paired-end data? Single-end (as individual datasets)
  • FASTQ file
    (Click on the multiple datasets icon and select all four of the FASTQ files)
    • SRR7108388.fastq.gz
    • SRR7108389.fastq.gz
    • SRR7108392.fastq.gz
    • SRR7108393.fastq.gz
  • Source for the reference genome Use built-in genome
  • Select a reference genome: Human Dec 2013. (GRCh38/hg38) (hg38)
  • Use defaults for the other fields
  • Execute

Quantification (Counting reads in features)

In order to test for differential expression, we need to count up how many times each “feature” is observed in each sample. The goal of such operations is to produce a count table such as that shown below. We can then apply statistical tests to these data

HTSeq-count creates a count matrix using the number of the reads from each bam file that map to the genomic features. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.

Various rules can be used to assign counts to features

To obtain the coordinates of each gene, we can use the UCSC genome browser which is integrated into Galaxy.

Obtaining gene coordinates

Ensembl

Counting reads in genes

RNA Analysis > htseq-count

  1. Use HTSeq-count to count the number of reads for each feature.
    In the left tool panel menu, under NGS Analysis, select NGS Analysis > htseq-count and set the parameters as follows:
    • Aligned SAM/BAM file
      (Select one of four bam files, or all four using the multiple datasets option)
    • GFF file UCSC Main on Mouse:ncbiRefSeq (genome)
    • Use defaults for the other fields
    • Execute
  2. Repeat for the remaining bam files if running on each bam separately.
  3. To make things easier to track, rename the ht-seq output for each sample to contain the corresponding sample name (e.g. SRR1552444.htseq). Do not rename the outputs that have “(no feature)” in their name

Create a count matrix

The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools such as Degust (see next section) which require the counts to be presented in a data matrix where each row is a gene and each column is a particular sample in the dataset.

Collection Operations -> Column Join on Collections

  • In the Tabular Files section, select the ht-seq count files from your history SRR1552444.htseq, SRR1552450, etc… Holding the CTRL key allows multiple files to be selected
  • Keep Identifier column as 1

Differential Expression using Degust

Differential expression is possible using Galaxy using the DESeq2 tool (for example). However, our particular recommendation is to use Degust for a more interactive experience. For this section, we will be using counts generated on the full dataset, rather than the downsampled data analysed in the previous section. These counts are available in the file GSE114013_salmon_counts.csv.

Differential expression

The term differential expression was first used to refer to the process of finding statistically significant genes from a microarray gene expression study.

Such methods were developed on the premise that microarray expression values are approximately normally-distributed when appropriately transformed (e.g. by using a log\(_2\) transformation) so that a modified version of the standard t-test can be used. The same test is applied to each gene under investigation yielding a test statistic, fold-change and p-value. Similar methods have been adapted to RNA-seq data to account for the fact that the data are count-based and do not follow a normal distribution.

http://degust.erc.monash.edu/

Degust is a web tool that can analyse the counts files produced in the step above, to test for differential gene expression. It offers and interactive view of the differential expression results

The input file is a count matrix where each row is a measured gene, and each column is a different biological sample. Within the tool we can configure which samples belong to the different biological groups of interest.

Read counts have to be normalised first prior to differential expression testing. There are two main biases that need to be accounted for:-

  • size of gene
    • longer genes will have more reads assigned to them
  • library size
    • a sample that is sequenced to a higher depth will receive more reads

However, R-based methods such as edgeR (implemented in Degust) and DESeq2 have their own method of normalising counts. You will probably encounter other methods of normalising RNA-seq reads such as RPKM, CPM, TPM etc. This blog provides a nice explanation of the current thinking. As part of the Degust output, you have the option of downloading normalised counts in various formats. Some other online visualisation tools require normalised counts as input, so it is good to have these to-hand.

Uploading the count matrix to Degust

  • From the main degust page, click Upload your counts file
  • Click on Browse
  • Select the location of the file GSE114013_salmon_counts.csv, and click Open.
  • Click Upload
  • A Configuration page will appear.

  • For Name type “GSE114013” (or whatever you want to call the analysis)
  • For Info columns select SYMBOL
  • Click Add condition
    • Referring to the experiment design (below), select the DMSO samples and call the condition DMSO
    • Repeat for the ITRACONAZOLE samples
  • Save the settings and then View the results
run name cell_line condition
SRR7108388 HT55_CONT_1 HT55 DMSO
SRR7108389 HT55_CONT_2 HT55 DMSO
SRR7108390 HT55_CONT_3 HT55 DMSO
SRR7108391 HT55_CONT_4 HT55 DMSO
SRR7108392 HT55_ITRA_1 HT55 ITRACONAZOLE
SRR7108393 HT55_ITRA_2 HT55 ITRACONAZOLE
SRR7108394 HT55_ITRA_3 HT55 ITRACONAZOLE
SRR7108395 HT55_ITRA_4 HT55 ITRACONAZOLE
SRR7108396 SW948_CONT_1 SW948 DMSO
SRR7108397 SW948_CONT_2 SW948 DMSO
SRR7108398 SW948_CONT_3 SW948 DMSO
SRR7108399 SW948_CONT_4 SW948 DMSO
SRR7108400 SW948_ITRA_1 SW948 ITRACONAZOLE
SRR7108401 SW948_ITRA_2 SW948 ITRACONAZOLE
SRR7108402 SW948_ITRA_3 SW948 ITRACONAZOLE
SRR7108403 SW948_ITRA_4 SW948 ITRACONAZOLE

Overview of Degust sections

  • Top black panel with Configure settings at right.
  • Left: Conditions: DMSO and ITRACONAZOLE.
  • Top centre: Plots, with options at right.
  • When either of the expression plots are selected, a heatmap appears below.
  • A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).

(Not that the screenshots are for illustration purposes and taken from a different dataset to that being analysed in the tutorial)

MA-plot

Each dot shows the change in expression in one gene.

  • The average expression (over both condition and treatment samples) is represented on the x-axis.
    • Plot points should be symmetrical around the x-axis.
    • We can see that many genes are expressed at a low level, and some are highly expressed.
  • The fold change is represented on the y axis.
    • If expression is significantly different between batch and chem, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
    • At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.

Click on the dot to see the gene name.

Parallel coordinates and heatmap

Each line shows the change in expression in one gene, between control and treatment.

  • Go to Options at the right.
    • For FDR cut-off set at 0.001.
    • This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
  • Look at the Parallel Coordinates plot. There are two axes:
    • Left: Control: Gene expression in the control samples. All values are set at zero.
    • Right: Treatment Gene expression in the treatment samples, relative to expression in the control.
  • The blocks of blue and red underneath the plot are called a heatmap.
    • Each block is a gene. Click on a block to see its line in the plot above.
    • Look at the row for the chem. Relative to batch, genes expressed more are red; genes expressed less are blue.

Table of genes

Table of genes

  • gene_id: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
  • FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
  • basal and luminal: log2(Fold Change) of gene expression. The default display is of fold change in the treatment relative to the control. Therefore, values in the batch column are zero. This can be changed in the Options panel at the top right.
    • In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.

The table can be sorted according to any of the columns (e.g. fold-change or p-value)

Download and R code

Above the genes table is the option to download the results of the current analysis to a csv file. You can also download the R code required to reproduce the analysis by clicking the Show R code box underneath the Options box.

Plots such as the MDS, MA and heatmap can also be exported by right-clicking on the plot.

MDS plot

This is a multidimensional scaling plot which represents the variation between samples. It is a similar concept to a Principal Components Analysis (PCA) plot. The x-axis is the dimension with the highest magnitude. In a standard control/treatment setup, samples should be split along this axis. A desirable plot is shown below:-

Exercise

Question: It seems that the differential expression analysis is detecting lots of genes. However, does this tell the whole story about the dataset? What do you think is the main factor separating samples on the x-axis, and thus explaining the most variation in the data?

Modifying the analysis

We will now repeat the analysis, but only for samples from the HT55 cell-line. The correct configuration for this analysis is shown below:-

Exercise: How many genes are differentially-expressed with an FDR < 0.05 and abs logFC > 1. Download this file and rename it to HT55.ITRACONAZOLE_vs_DMSO.csv.

Exercise: Rest the FDR cut-off and abs LogFC cutoffs back to default and download the file and rename to background.csv. We will use this later.

Exercise: Repeat the analysis for SW948 samples and download the table of differentially-expressed results (same FDR and log fold-change) to SW948.ITRACONAZOLE_vs_DMSO.csv

File Downloads

If you didn’t manage to complete these analyses, you can download the files from here by right-clicking on each link and selecting “Save Link as” (or equivalent). They are also available in the course google drive.

Overlapping Gene Lists

We might sometimes want to compare the lists of genes that we identify using different methods, or genes identified from more than one contrast. In our example dataset we can compare the genes in the contrast of ITRACONAZOLE vs DMSO in HT55 and SW948 cells

The website venny provides a really nice interface for doing this.

  • Open both your HT55: ITRACONAZOLE vs DMSO and SW948: ITRACONAZOLE vs DMSO results files in Excel
  • Go to the venny website
  • Copy the names of genes with adjusted p-value less than 0.05 in the HT55 analysis into the List 1 box on the venny website. List 1 can be renamed to HT55
    • You can select all entries in a column with the shortcut CTRL + SPACE
  • Copy the names of genes with adjusted p-value less than 0.05 in the SW948 analysis into the List 2 box on the venny website. List 2 can be renamed to SW948
  • venny should now report the number of genes found in each list, the size of the intersection, and genes unique to each method

Refined analysis

The final analysis we will perform is to include all the samples, but correct for the differences in cell-line. This is achieved by telling Degust about the hidden factors in our dataset. The hidden factor in this dataset is whether the sample is from the HT55 or SW948 samples. However, we only need to specify which samples are from HT55. Other hidden factors you might need to include could be (depending on the MDS plot) :-

  • sample batch
  • gender

See below for the correct configuration to include the hidden factors.

Enrichment and pathways analysis

In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.

In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.

There are two different approaches one might use, and we will cover the theory behind both. The distinction is whether you are happy to use a hard (and arbitrary) threshold to identify DE genes.

Over-representation analysis

“Threshold-based” methods require defintion of a statistical threshold to define list of genes to test (e.g. FDR < 0.01). Then a hypergeometric test or Fisher’s Exact test generally used. They are typically used in situations where plenty of DE genes have been identified, and people often use quite relaxed criteria for identifying DE genes (e.g. raw rather than adjusted p-values or FDR value)

The question we are asking here is;

“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”

We can answer this question by knowing

  • the total number of DE genes
  • the number of genes in the gene set (pathway or process)
  • the number of genes in the gene set that are found to be DE
  • the total number of tested genes (background)

The formula for Fishers exact test is;

\[ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]

with:-

is DE Not DE Row Total
In Gene Set a b a + b
Not in Gene Set c d c + d
Column Total a + c b + d a + b + c + d = n

In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- http://www.geneontology.org/

Using GOrilla

There are several popular online tools for performing enrichment analysis

We will be using the online tool GOrilla to perform the pathways analysis as it is particularly straightforward. It has two modes; the first of which accepts a list of background and target genes.

  1. Go to http://cbl-gorilla.cs.technion.ac.il/
  2. Choose Organism: Homo Sapiens
  3. Choose running mode: Two unranked lists of genes
  4. Paste the gene symbols corresponding to DE genes in SW948: ITRACONAZOLE vs DMSO into the Target set.
  • The shortcut CTRL + SPACE will let you select an entire column
  1. Paste the gene symbols from the Background set into the other box. This should be the names of all genes present in the Background file. i.e. all genes that were tested.
  2. Choose an Ontology: Process
  3. Under advanced options, tick Output results in Microsoft Excel format
  4. Search Enriched GO terms

You should be presented with a graph of enriched GO terms showing the relationship between the terms. Each GO term is coloured according to its statistical significance.

Below the figure is the results table. This links to more information about each GO term, and lists each gene in the category that was found in your list. The enrichment column gives 4 numbers that are used to determine enrichment (similar to the Fisher exact test we saw earlier)

  • N, total number of genes (should be the same in all rows)
  • B, total number of genes annotated with the GO term
  • n, total number of genes that were found in the list you uploaded (same for all rows)
  • b, number of genes in the list you uploaded that intersect with this GO term

Exercise: Repeat the GOrilla to find enriched pathways in the HT55: ITRACONAZOLE vs DMSO analysis. What do you notice?

Threshold-free analysis

This type of analysis is popular for datasets where differential expression analysis does not reveal many genes that are differentially-expressed on their own. Instead, it seeks to identify genes that as a group have a tendancy to be near the extremes of the log-fold changes. The results are typically presented in the following way.

As such, it does not rely on having to impose arbitrary cut-offs on the data. Instead, we need to provide a measure of the importance of each gene such as it’s fold-change. These are then used the rank the genes.

The Broad institute has made this analysis method popular and provides a version of GSEA that can be run via a java application. However, the application can be a bit fiddly to run, so we will use the GeneTrail website instead

  • Open the file background.csv in Excel and delete all columns except the SYMBOL and ITRA column.
  • Go to the GeneTrail website, and select Transcriptomics from the front page
  • Select the Paste the content of a text file in a tabular format option and the contents of your modified excel file into the box. Do not paste the column headings
  • Click Upload

Hopefully it should recognise your input without any errors, and on the next screen the Set-level statistic should be automatically set to GSEA

If your data does not get uploaded, double-check that the column heading ITRA has not been pasted into the text box

To make the analysis run fast, you can de-select the GO pathways (biological processes, molecular function and cellular compartment)

---
title: "RNA-seq tutorial"
author: "Mark Dunning"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
  html_document:
    df_print: paged
    toc: yes
editor_options:
  chunk_output_type: inline
---



# Hands-on RNA-seq Analysis from fastq


### Sheffield Bioinformatics Core
<img src="media/logo-sm.png" align=right>

web : [sbc.shef.ac.uk](https://sbc.shef.ac.uk)  
twitter: [@SheffBioinfCore](https://twitter.com/SheffBioinfCore)  
email: [bioinformatics-core@sheffield.ac.uk](bioinformatics-core@sheffield.ac.uk)

-----

## Tutorial Overview

This tutorial will cover the basics of RNA-seq using Galaxy and Degust; two open-source web-based platforms for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.


## Using these notes

<div class="exercise">
Sections with this background indicate exercises to be completed during the workshop.
</div>

<div class="information">
Sections with this background highlight particular shortcuts or other references that might be useful.
</div>

<div class="warning">
Sections with this background give information about potential error messages or might encounter, or problems that might arise in your own data analysis.
</div>

### More on Galaxy

The official Galaxy page has many [tutorials](https://galaxyproject.org/learn/) on using the service, and examples of other types of analysis that can be performed on the platform.

Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses

### Courses on analysing RNA-seq data in R

- [Sheffield Bioinformatics Core](https://sbc.shef.ac.uk/training/rna-seq-in-r-2020-02-13/)
- [Monash Bioinformatics Platform](http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/)


## RNA-seq workflow

Two workflows are possible with RNA-seq data - with the difference being whether one performs an alignment to the reference genome or not.

Recent tools for RNA-seq analysis (e.g. `salmon`, `kallisto`) do not require the time-consuming step of whole-genome alignment to be performed, and can therefore produce gene-level counts in a much faster time frame. They not require the creation of large bam files, which is useful if constrained by file space on Galaxy.

![](https://hbctraining.github.io/Intro-to-rnaseq-hpc-gt/img/alignmentfree_workflow_aug2017.png)

(image from Harvard Bioinformatics Core)


-----

## Background

#### Where do the data in this tutorial come from?
The data for this tutorial comes from a Journal of Experimental Medicine paper ["Itraconazole targets cell cycle heterogeneity in colorectal cancer"](https://pubmed.ncbi.nlm.nih.gov/29853607/). This study examines the expression profiles of two cell lines in response to treatment with itraconazole.



For this tutorial, we will assume that the *wet-lab* stages of the experiment have been performed and we are now in the right-hand branch of the workflow. In this tutorial we will demonstrate the steps of **Quality assessment**, **alignment**, **quantification** and **differential expression testing**.

The fastq data for this experiment were made available on the Sequencing Read Archive (SRA) with accession SRP144496. **For the purposes of this workshop we have created a downsampled dataset**

The experimental design for the dataset is summarised in the table below.

run	| name	| cell_line	| condition
----|-------|-----------|--------- 
SRR7108388	| HT55_CONT_1	| HT55	| DMSO
SRR7108389	| HT55_CONT_2	| HT55	| DMSO
SRR7108390	| HT55_CONT_3	| HT55	| DMSO
SRR7108391	| HT55_CONT_4	| HT55	| DMSO
SRR7108392	| HT55_ITRA_1	| HT55	| ITRACONAZOLE
SRR7108393	| HT55_ITRA_2	| HT55	| ITRACONAZOLE
SRR7108394	| HT55_ITRA_3	| HT55	| ITRACONAZOLE
SRR7108395	| HT55_ITRA_4	| HT55	| ITRACONAZOLE
SRR7108396	| SW948_CONT_1 | SW948	| DMSO
SRR7108397	| SW948_CONT_2 | SW948	| DMSO
SRR7108398	| SW948_CONT_3 | SW948	| DMSO
SRR7108399	| SW948_CONT_4 | SW948	| DMSO
SRR7108400	| SW948_ITRA_1 | SW948	| ITRACONAZOLE
SRR7108401	| SW948_ITRA_2 | SW948	| ITRACONAZOLE
SRR7108402	| SW948_ITRA_3 | SW948	| ITRACONAZOLE
SRR7108403	| SW948_ITRA_4 | SW948	| ITRACONAZOLE

#### Digression: How to download raw sequencing files

The Sequencing Read Archive (SRA) is commonly-used to store the raw data from sequencing experiements and can be accessed through the NCBI website. However, the interface is not particularly friendly and the links to download data and not easy to obtain.

An easier alternative exists in the form of SRA Explorer

- [https://sra-explorer.info/](https://sra-explorer.info/)

<img src="media/sra_explorer.png"/>

The SRA accession (usually found in a paper describing the dataset) can be entered into the Search box, and all the samples belonging to that dataset should be found. Samples of interest can be saved, and upon "checkout" the download links (URLs) will be displayed. A command-line tool such as `curl` or `wget` can then be used to download the files locally.

## Section 1: Preparation
#### 1. Sign-up to the European Galaxy server


- https://usegalaxy.eu
- The Australian server is an alternative if the European one is down:- https://usegalaxy.org.au/


**Make sure you check your email to activate your account**

Now, use this link to access a special queue. This link will only be active on **December 6th**.

- [https://usegalaxy.eu/join-training/sbcgalaxy-2021-12-01](https://usegalaxy.eu/join-training/sbcgalaxy-2021-12-01)

#### 2. Download the course data

The data for this course have all been shared on a google drive. If you have not done so already, please go to this directory and download the following files

https://drive.google.com/drive/folders/1RSuvl9shAw12Bj77uYSUdWtkZ5ST5EWi?usp=sharing

- `SRR7108388.fastq.gz`
- `SRR7108389.fastq.gz`
- `SRR7108392.fastq.gz`
- `SRR7108393.fastq.gz`
- `Homo_sapiens.GRCh38.cdna.all.fa.gz`
- `Homo_sapiens.GRCh38.104.gtf.gz`
- `tx2gene.txt`


#### 3.  Import the RNA-seq data for the workshop.

We can going to import the [*fastq* files](https://en.wikipedia.org/wiki/FASTQ_format) for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. To make the practical quicker, we have *downsampled* the original fastq files to a quarter of a million reads.



<div class="information">

#### **Get Data -> Upload File ** 

</div>

You can import the data by:

1.  In the tool panel located on the left, under Basic Tools select **Get Data > Upload File**. Click on the **Choose local file** button on the
    bottom section of the pop-up window.
2.  Navigate to the `fastq` directory of the zip file that you downloaded from google drive and select these two files from the HT55-DMSO condition. 

`SRR7108388.fastq.gz`
`SRR7108389.fastq.gz`
 
 and these two files are from the HT55 ITRACONAZOLE condition.

`SRR7108392.fastq.gz`
`SRR7108393.fastq.gz`

also upload the files `Homo_sapiens.GRCh38.cdna.all.fa.gz`, `Homo_sapiens.GRCh38.104.gtf.gz` and `tx2gene.txt`. These are reference files that we will use later.

3.  You should now have these 4 fastq files in your history:
    - `SRR7108388.fastq.gz`
    - `SRR7108389.fastq.gz`
    - `SRR7108392.fastq.gz`
    - `SRR7108393.fastq.gz`

The annotation files may take a while longer to upload

<div class="information">
The `.gz` at the end of each file name means that it is *compressed* (like a zip file). 
</div>

<div class="information">
You can upload the other files for extra practice if you wish
</div>

### Quality assessment with FastQC (Optional)

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a popular tool from [Babraham Institute Bioinformatics Group](https://www.bioinformatics.babraham.ac.uk/index.html) used for *quality assessment* of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The [documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user's attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.

<div class="information">

#### *FastQ Quality Control* -> *FastQC Read Quality reports*

</div>

- Select one of the FASTQ files as input and *Execute* the tool.
- When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.
- Run Fastqc on the remaing fastq files, but don't examine the results just yet.


<div class="exercise">

**Question: Do the data seem to be of reasonable quality? **

You can use the [documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/) to help interpret the plots

</div>

If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to *trim* our data.

However, a recent paper demonstrated that read trimming is no longer required prior to alignment:- https://www.biorxiv.org/content/10.1101/833962v1


<div class="warning">

If you also suspect contamination by another organism, or rRNA present in your data, you can use the sortMeRNA tool to remove this artefact.
</div>


### Combining QC reports

It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The [multiqc](https://multiqc.info/) tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.


<div class="information">

*FASTQ Quality Control* -> *Multiqc*

</div>

Under *Which tool was used generate logs?* Choose *fastqc* and select the RawData output from the fastqc run on each of your bam files.


<div class="exercise">

Question: Repeat the FastQC analysis for the remaining fastq files and combine the reports with `multiQC`. Do the fastq files seem to have consistently high-quality?
</div>


## Section 2: Quantification

We can align our RNA-seq reads to a reference *genome*, and then overlap with know gene coordinates, but many prefer to align directly to the *transcriptome* sequences using a method such as salmon or kallisto. We will demonstrate the salmon protocol.

### Obtaining the files for salmon

1) cDNA fasta file

This file has been provided in the google drive folder and should have been uploaded to your Galaxy history.

- [https://drive.google.com/drive/folders/1RSuvl9shAw12Bj77uYSUdWtkZ5ST5EWi?usp=sharing](https://drive.google.com/drive/folders/1RSuvl9shAw12Bj77uYSUdWtkZ5ST5EWi?usp=sharing)

However, it is useful to know where this file came from in case you are not working with Human data. The file was obtained from [Ensembl](http://m.ensembl.org/info/data/ftp/index.html) by clicking on the **cDNA (FASTA)** link for the appropriate organism (Human). 

<img src="media/ensembl_download.png"/>

On the next screen,  Right-click to save the `.cdna.all.fa.gz` to your computer

<img src="media/cdna_download.png"/>

2) Transcript mapping file

By default, salmon will produce counts for each *transcript*. This might be what we want, but for most standard analyses it is preferable to work at the gene-level. We therefore have to tell salmon how the transcripts in the cDNA file relate to known genes. Such a file can be obtained from [biomart](https://ensembl.org/biomart/martview/). 

- Select the Ensembl genes (104) database
- Select the dataset Human genes (GRCh38.13)
- In Attributes, click the "+" button next to GENE and select *Transcript stable ID version* and *Gene stable ID*. Make sure the order on the left-hand panel is *Transcript stable ID version*, followed by *Gene stable ID*. This will affect the column order in the file. You can tick / untick the IDs to make sure the order is correct.

<img src="media/biomart_transcript_export.PNG"/>

- Click the Results button in the top left corner to see a preview of the results. Clicking the Go button will export the results to a file

<img src="media/biomart_transcript_preview.PNG"/>


**It is important that the file downloaded from Biomart is edited so that the column headings do not contain any spaces**. You can do this in a text editor. The edited file should look like this.

<img src="media/biomart_transcript_edit.PNG"/>

<div class="warning">

It is important to make sure the version number of your transcript file and the biomaRt dataset are **the same**, otherwise some of the steps downstream might not work as expected.
</div>

If you have problems, this mapping file is also provided in the google drive as `tx2gene.txt`.

3) Annotation file (optional)

The Ensembl gene IDs are not particularly memorable, so it would be highly beneficial to have other annotations at hand to help us interpret the data downstream. We can use the biomart website again to produce a table to downstream intrepretation. 

This time, select only the *Gene Stable ID* tickbox in the GENE box. Expand the EXTERNAL panel by clicking the "+" next to EXTERNAL, and select *HGNC symbol* and *NCBI gene (formerly Entrezgene) ID*


### salmon configuration and running

<div class="information">
**RNA Analysis** -> **Salmon quant**
</div>


- Select the *Homo_sapiens_GRCh38.cdna.all.fa.gz* file as the Transcripts fasta file
- Select all your uploaded fastq files as your Data Input FASTQ/FASTA file
- Scroll down to *File containing a mapping of transcripts to genes* and select the `tx2gene.txt` file

Two jobs will now be queued for each sample fastq file. The Quantification output will contain transcript-level data, and the Gene Quantification output will be at the *gene-level*. We should expect the number of lines in the Gene Quantification file to be substantially less. If not, you will need to check that your transcript mapping file was correct.

The Gene Quantification output from each sample comprises the following columns (taken from the [salmon documentation](https://salmon.readthedocs.io/en/latest/file_formats.html))

- Name — This is the name of the target transcript provided in the input transcript database (FASTA file).
- Length — This is the length of the target transcript in nucleotides.
- EffectiveLength — This is the computed effective length of the target transcript. It takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias (if they are being modeled).
- TPM — This is salmon’s estimate of the relative abundance of this transcript in units of Transcripts Per Million (TPM). TPM is the recommended relative abundance measure to use for downstream analysis.
- NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

Note that we are using a downsampled dataset, so the majority of NumReads will be zero.

### Create a count matrix

Methods for detecting differential expression are likely to want data in the form of a table; where every row is a different gene and each column is a unique biological sample. Before we can proceed we will therefore need to *merge* our salmon results into a single output. This can be down using the *Salmon quantmerge* tool

<div class="information">
**RNA Analysis** -> **Salmon quantmerge**
</div>

Use the +Insert Quant file and names button repeatedly to select each of your **Gene Quantification** outputs. The One-word sample names text box can be used to create a shorter column name for each output.

Once all the Gene Quantification files have been selected the drop-down menu under **Columns** should be changed from Length to **NumReads**.

After the tool has finished you should have a table with 

### Adding extra annotation to results

<div class="information">
**Text Manipulation** -> **Join two files**
</div>

- 1st file: *Column Join on data....*
- Column to use from 1st file: Column 1
- 2nd file: result from *annotateMyIDs on data...*
- Column to use from 2nd file: Column 1

## **(Optional)** Alternative workflow involving genome alignment

If time allows, we will also follow this section

![](https://databeauty.com/figures/2016-09-13-RNA-seq-analysis/rna_seq_workflow.png)

The workflow that people used for many years is summarised in this image from Ting-you Wang's [RNA-seq data analysis page](https://databeauty.com/blog/tutorial/2016/09/13/RNA-seq-analysis.html), and may still be preferable if your analysis doesn't just call for gene-level counts. 
<div class="information">

Mapping -> HISAT2

</div>

#### 1.  Map/align the reads with HISAT2 to the hg38 reference genome
In the left tool panel menu, under NGS Analysis, select
**Mapping > HISAT2** and set the parameters as follows:  

- **Is this single-end or paired-end data?** Single-end (as individual datasets)  
- **FASTQ file**  
(Click on the multiple datasets icon and select all four of the
FASTQ files)
    - `SRR7108388.fastq.gz`
    - `SRR7108389.fastq.gz`
    - `SRR7108392.fastq.gz`
    - `SRR7108393.fastq.gz`

- **Source for the reference genome** Use
built-in genome
- **Select a reference genome:** Human Dec 2013. (GRCh38/hg38) (hg38)
- Use defaults for the other fields
- Execute



### Quantification (Counting reads in features)

In order to test for differential expression, we need to count up how many times each "feature" is observed in each sample. The goal of such operations is to produce a *count table* such as that shown below. We can then apply statistical tests to these data

![](media/counts.png)

HTSeq-count creates a count matrix using the number of the reads from each bam
file that map to the genomic features. For each feature (a
gene for example) a count matrix shows how many reads were mapped to this
feature.

Various rules can be used to assign counts to features

![](media/htseq.png)

To obtain the coordinates of each gene, we can use the UCSC genome browser which is integrated into Galaxy.

### Obtaining gene coordinates

[Ensembl](http://m.ensembl.org/info/data/ftp/index.html) 

### Counting reads in genes

<div class="information">
**RNA Analysis > htseq-count**
</div>

1.  Use HTSeq-count to count the number of reads for each feature.  
    In the left tool panel menu, under NGS Analysis, select
    **NGS Analysis > htseq-count** and set the parameters as follows:  
    - **Aligned SAM/BAM file**  
      (Select one of four bam files, or all four using the multiple datasets option)
    - **GFF file** UCSC Main on Mouse:ncbiRefSeq (genome)
    - Use defaults for the other fields
    - Execute
2.  Repeat for the remaining bam files if running on each bam separately.
3.  To make things easier to track, rename the ht-seq output for each sample to contain the corresponding sample name (e.g. SRR1552444.htseq). **Do not rename the outputs that have "(no feature)" in their name**

### Create a count matrix

The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools such as Degust (see next section) which require the counts to be presented in a data matrix where each row is a gene and each column is a particular sample in the dataset.

<div class="information">
*Collection Operations -> Column Join* on Collections
</div>

- In the *Tabular Files* section, select the `ht-seq` count files from your history *SRR1552444.htseq*, *SRR1552450*, etc... Holding the CTRL key allows multiple files to be selected
- Keep *Identifier column* as `1`




# Differential Expression using Degust

Differential expression is possible using Galaxy using the DESeq2 tool (for example). However, our particular recommendation is to use Degust for a more interactive experience. For this section, we will be using counts generated on the *full dataset*, rather than the *downsampled* data analysed in the previous section. These counts are available in the file `GSE114013_salmon_counts.csv`.


## Differential expression

The term *differential expression* was first used to refer to the process of finding statistically significant genes from a *microarray* gene expression study.

![](media/de_explained.png)


Such methods were developed on the premise that microarray expression values are approximately *normally-distributed* when appropriately transformed (e.g. by using a log$_2$ transformation) so that a modified version of the standard *t-test* can be used. The same test is applied to each gene under investigation yielding a *test statistic*, *fold-change* and *p-value*. Similar methods have been adapted to RNA-seq data to account for the fact that the data are *count-based* and do not follow a normal distribution.


![](media/rna_advanced_degust_1.png)

<font size="8">[http://degust.erc.monash.edu/](http://degust.erc.monash.edu/)</font>

`Degust` is a web tool that can analyse the counts files produced in the step above, to test for differential gene expression. It offers and interactive view of the differential expression results

The input file is a count matrix where each row is a measured gene, and each column is a different biological sample. Within the tool we can configure which samples belong to the different biological groups of interest.

Read counts have to be normalised first prior to differential expression testing. There are two main biases that need to be accounted for:-

- size of gene
    + *longer* genes will have more reads assigned to them
- library size
    + a sample that is sequenced to a higher depth will receive more reads
  
However, R-based methods such as `edgeR` (implemented in Degust) and `DESeq2` have their own method of normalising counts. You will probably encounter other methods of normalising RNA-seq reads such as *RPKM*, *CPM*, *TPM* etc. [This blog](https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) provides a nice explanation of the current thinking. As part of the `Degust` output, you have the option of downloading normalised counts in various formats. Some other online visualisation tools require normalised counts as input, so it is good to have these to-hand.


### Uploading the count matrix to Degust


- From the main degust page, click *Upload your counts file*
- Click on Browse
- Select the location of the file `GSE114013_salmon_counts.csv`, and click *Open*.
- Click *Upload*
- A Configuration page will appear.

![](media/degust_config.png)

- For Name type "*GSE114013*" (or whatever you want to call the analysis)
- For Info columns select *SYMBOL*
- Click Add condition
    + Referring to the experiment design (below), select the DMSO samples and call the condition DMSO
    + Repeat for the ITRACONAZOLE samples
- Save the settings and then View the results

run	| name	| cell_line	| condition
----|-------|-----------|--------- 
SRR7108388	| HT55_CONT_1	| HT55	| DMSO
SRR7108389	| HT55_CONT_2	| HT55	| DMSO
SRR7108390	| HT55_CONT_3	| HT55	| DMSO
SRR7108391	| HT55_CONT_4	| HT55	| DMSO
SRR7108392	| HT55_ITRA_1	| HT55	| ITRACONAZOLE
SRR7108393	| HT55_ITRA_2	| HT55	| ITRACONAZOLE
SRR7108394	| HT55_ITRA_3	| HT55	| ITRACONAZOLE
SRR7108395	| HT55_ITRA_4	| HT55	| ITRACONAZOLE
SRR7108396	| SW948_CONT_1 | SW948	| DMSO
SRR7108397	| SW948_CONT_2 | SW948	| DMSO
SRR7108398	| SW948_CONT_3 | SW948	| DMSO
SRR7108399	| SW948_CONT_4 | SW948	| DMSO
SRR7108400	| SW948_ITRA_1 | SW948	| ITRACONAZOLE
SRR7108401	| SW948_ITRA_2 | SW948	| ITRACONAZOLE
SRR7108402	| SW948_ITRA_3 | SW948	| ITRACONAZOLE
SRR7108403	| SW948_ITRA_4 | SW948	| ITRACONAZOLE



### Overview of Degust sections

- Top black panel with Configure settings at right.
- Left: Conditions: DMSO and ITRACONAZOLE.
- Top centre: Plots, with options at right.
- When either of the expression plots are selected, a heatmap appears below.
- A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).

(**Not that the screenshots are for illustration purposes and taken from a different dataset to that being analysed in the tutorial**)

![](http://sepsis-omics.github.io/tutorials/modules/dge/images/image12.png)


### MA-plot

![](media/degust_ma.png)

Each dot shows the change in expression in one gene.

- The average expression (over both condition and treatment samples) is represented on the x-axis.
    + Plot points should be symmetrical around the x-axis.
    + We can see that many genes are expressed at a low level, and some are highly expressed.
- The fold change is represented on the y axis.
    + If expression is significantly different between batch and chem, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
    + At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.

Click on the dot to see the gene name.

### Parallel coordinates and heatmap

![](media/degust_parallel_heatmap.png)

Each line shows the change in expression in one gene, between control and treatment.

- Go to Options at the right.
    + For FDR cut-off set at 0.001.
    + This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
- Look at the Parallel Coordinates plot. There are two axes:
    + Left: Control: Gene expression in the control samples. All values are set at zero.
    + Right: Treatment Gene expression in the treatment samples, relative to expression in the control.
- The blocks of blue and red underneath the plot are called a heatmap.
    + Each block is a gene. Click on a block to see its line in the plot above.
    + Look at the row for the chem. Relative to batch, genes expressed more are red; genes expressed less are blue.

### Table of genes

![](media/degust_gene_table.png)

Table of genes

- gene_id: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
- FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
- basal and luminal: log2(Fold Change) of gene expression. The default display is of fold change in the treatment relative to the control. Therefore, values in the batch column are zero. This can be changed in the Options panel at the top right.
    + In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.

The table can be sorted according to any of the columns (e.g. fold-change or p-value)


### Download and R code

Above the genes table is the option to download the results of the current analysis to a csv file. You can also download the *R* code required to reproduce the analysis by clicking the *Show R code* box underneath the Options box.

Plots such as the MDS, MA and heatmap can also be exported by right-clicking on the plot.


### MDS plot

This is a multidimensional scaling plot which represents the variation between samples. It is a similar concept to a Principal Components Analysis (PCA) plot. The x-axis is the dimension with the highest magnitude. In a standard control/treatment setup, samples should be split along this axis. A desirable plot is shown below:-

![](media/degust_mds.png)

### Exercise

<div class="exercise">
**Question:** It seems that the differential expression analysis is detecting lots of genes. However, does this tell the whole story about the dataset? What do you think is the main factor separating samples on the x-axis, and thus explaining the most variation in the data?

</div>



## Modifying the analysis

We will now repeat the analysis, but only for samples from the *HT55* cell-line. The correct configuration for this analysis is shown below:-

![](media/degust_ht55.PNG)

<div class="exercise">
**Exercise**: How many genes are differentially-expressed with an FDR < 0.05 and abs logFC > 1. Download this file and rename it to `HT55.ITRACONAZOLE_vs_DMSO.csv`.
</div>

<div class="exercise">
**Exercise:** Rest the FDR cut-off and abs LogFC cutoffs back to default and *download* the file and rename to `background.csv`. We will use this later.
</div>

<div class="exercise">
**Exercise**: Repeat the analysis for SW948 samples and download the table of differentially-expressed results (same FDR and log fold-change) to `SW948.ITRACONAZOLE_vs_DMSO.csv`
</div>



### File Downloads

<div class="information">
If you didn't manage to complete these analyses, you can download the files from here by right-clicking on each link and selecting "Save Link as" (or equivalent). They are also available in the course google drive.

- [HT55.ITRACONAZOLE_vs_DMSO.csv](HT55.ITRACONAZOLE_vs_DMSO.csv)
- [SW948.ITRACONAZOLE_vs_DMSO.csv](SW948.ITRACONAZOLE_vs_DMSO.csv)
- [background.csv](background.csv)

</div>

### Overlapping Gene Lists

We might sometimes want to compare the lists of genes that we identify using different methods, or genes identified from more than one contrast. In our example dataset we can compare the genes in the contrast of ITRACONAZOLE vs DMSO in HT55 and SW948 cells

The website *venny* provides a really nice interface for doing this.

![](media/venny_config.png)

- Open both your *HT55: ITRACONAZOLE vs DMSO* and *SW948: ITRACONAZOLE vs DMSO* results files in Excel
- Go to the venny website
    + http://bioinfogp.cnb.csic.es/tools/venny/
- Copy the names of genes with adjusted p-value less than 0.05 in the HT55 analysis into the **List 1** box on the venny website. **List 1** can be renamed to *HT55*
    + *You can select all entries in a column with the shortcut CTRL + SPACE*
- Copy the names of genes with adjusted p-value less than 0.05 in the SW948 analysis into the **List 2** box on the venny website. **List 2** can be renamed to **SW948**
- venny should now report the number of genes found in each list, the size of the intersection, and genes unique to each method

### Refined analysis

The final analysis we will perform is to include all the samples, but correct for the differences in cell-line. This is achieved by telling Degust about the *hidden factors* in our dataset. The hidden factor in this dataset is whether the sample is from the **HT55** or **SW948** samples. However, we only need to specify which samples are from HT55. Other hidden factors you might need to include could be (depending on the MDS plot) :-

- sample batch
- gender

See below for the correct configuration to include the hidden factors.

<img src="media/hidden_factor.png"/>

# Enrichment and pathways analysis

In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.

In this section we move towards discovering if our results are ***biologically significant***. Are the genes that we have picked statistical flukes, or are there some commonalities. 

There are two different approaches one might use, and we will cover the theory behind both. The distinction is whether you are happy to use a hard (and arbitrary) threshold to identify DE genes.


## Over-representation analysis

"Threshold-based" methods require defintion of a statistical threshold to define list of genes to test (e.g. FDR < 0.01). Then a *hypergeometric* test or *Fisher's Exact* test generally used. They are typically used in situations where plenty of DE genes have been identified, and people often use quite relaxed criteria for identifying DE genes (e.g. raw rather than adjusted p-values or FDR value)

The question we are asking here is;

> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"***

We can answer this question by knowing

- the total number of DE genes
- the number of genes in the gene set (pathway or process)
- the number of genes in the gene set that are found to be DE
- the total number of tested genes (background)

The formula for Fishers exact test is;

$$ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} $$

with:-

|              | is DE | Not DE | Row Total |
|------------- | ------|--------|-----------|
| In Gene Set   | a     | b      | a + b     |
| Not in Gene Set  | c     | d      | c + d     |
| Column Total    |  a + c | b + d  | a + b + c + d = n |


In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- http://www.geneontology.org/


## Using GOrilla

There are several popular online tools for performing enrichment analysis

We will be using the online tool [GOrilla](http://cbl-gorilla.cs.technion.ac.il/) to perform the pathways analysis as it is particularly straightforward. It has two modes; the first of which accepts a list of *background* and *target* genes. 

1. Go to http://cbl-gorilla.cs.technion.ac.il/
2. Choose Organism: `Homo Sapiens`
3. Choose running mode: `Two unranked lists of genes`
4. Paste the gene symbols corresponding to DE genes in *SW948: ITRACONAZOLE vs DMSO* into the Target set.
  + **The shortcut CTRL + SPACE will let you select an entire column**
5. Paste the gene symbols from the Background set into the other box. This should be the names of all genes present in the Background file. i.e. all genes that were tested.
6. Choose an Ontology: `Process`
7. Under advanced options, tick **Output results in Microsoft Excel format**
8. `Search Enriched GO terms`

You should be presented with a graph of enriched GO terms showing the relationship between the terms. Each GO term is coloured according to its statistical significance.

Below the figure is the results table. This links to more information about each GO term, and lists each gene in the category that was found in your list. The enrichment column gives 4 numbers that are used to determine enrichment (similar to the Fisher exact test we saw earlier)

- N, total number of genes (should be the same in all rows)
- B, total number of genes annotated with the GO term
- n, total number of genes that were found in the list you uploaded (same for all rows)
- b, number of genes in the list you uploaded that intersect with this GO term

<div class="exercise">
**Exercise:** Repeat the GOrilla to find enriched pathways in the HT55: ITRACONAZOLE vs DMSO analysis. What do you notice?
</div>


## Threshold-free analysis

This type of analysis is popular for datasets where differential expression analysis does not reveal many genes that are differentially-expressed on their own. Instead, it seeks to identify genes that as a group have a tendancy to be near the extremes of the log-fold changes. The results are typically presented in the following way.

![](media/overexpressed-gsea.png)

As such, it does not rely on having to impose arbitrary cut-offs on the data. Instead, we need to provide a measure of the importance of each gene such as it's fold-change. These are then used the rank the genes.

The Broad institute has made this analysis method popular and provides [a version of GSEA](http://software.broadinstitute.org/gsea/index.jsp) that can be run via a java application. However, the application can be a bit fiddly to run, so we will use the GeneTrail website instead

- Open the file `background.csv` in Excel and delete all columns except the `SYMBOL` and `ITRA` column.
<img src="media/GeneTrail_prep.png"/>
- Go to the GeneTrail website, and select Transcriptomics from the front page
- Select the **Paste the content of a text file in a tabular format option** and the contents of your modified excel file into the box. **Do not paste the column headings**
- Click Upload

Hopefully it should recognise your input without any errors, and on the next screen the **Set-level statistic** should be automatically set to **GSEA**

<div class="warning">
If your data does not get uploaded, double-check that the column heading **ITRA** has not been pasted into the text box
</div>

To make the analysis run fast, you can de-select the GO pathways (biological processes, molecular function and cellular compartment)

